# Zero-shot Transfer

Openvision Vit Base Patch16 160
Apache-2.0
OpenVision is a fully open-source, cost-effective family of advanced vision encoders for multimodal learning.
Multimodal Fusion
O
UCSC-VLAA
15
0
Vica2 Stage2 Onevision Ft
Apache-2.0
ViCA2 is a 7B-parameter multimodal vision-language model focused on video understanding and visual-spatial cognition tasks.
Video-to-Text Transformers English
V
nkkbr
63
0
Blip Custom Captioning
Bsd-3-clause
BLIP is a unified vision-language pretraining framework, excelling in vision-language tasks such as image caption generation
Image-to-Text
B
hiteshsatwani
78
0
Vit So400m Patch16 Siglip 256.v2 Webli
Apache-2.0
SigLIP 2 ViT model, containing only the image encoder part for image feature extraction, trained on the WebLI dataset.
Text-to-Image Transformers
V
timm
12.56k
0
Vit So400m Patch14 Siglip 224.v2 Webli
Apache-2.0
A Vision Transformer model based on SigLIP 2 architecture, designed for image feature extraction and pretrained on the webli dataset.
Image Classification Transformers
V
timm
7,005
0
Vit Large Patch16 Siglip 384.v2 Webli
Apache-2.0
A vision Transformer model based on the SigLIP 2 architecture, designed for image feature extraction, pretrained on the webli dataset
Text-to-Image Transformers
V
timm
4,265
0
Vit Large Patch16 Siglip 256.v2 Webli
Apache-2.0
Vision Transformer model based on SigLIP 2 architecture, designed for image feature extraction, trained on the webli dataset
Image Classification Transformers
V
timm
525
0
Vit Giantopt Patch16 Siglip 384.v2 Webli
Apache-2.0
ViT image encoder based on SigLIP 2, designed for timm, suitable for vision-language tasks
Image Classification Transformers
V
timm
160
0
Vit Base Patch16 Siglip Gap 256.v2 Webli
Apache-2.0
A ViT image encoder based on SigLIP 2, employing global average pooling with the attention pooling head removed, suitable for image feature extraction.
Multimodal Fusion Transformers
V
timm
114
1
Vit Base Patch16 Siglip 384.v2 Webli
Apache-2.0
Vision Transformer model based on SigLIP 2, designed for image feature extraction, pre-trained on the webli dataset
Text-to-Image Transformers
V
timm
330
0
Vit Base Patch16 Siglip 224.v2 Webli
Apache-2.0
ViT model based on SigLIP 2, focused on image feature extraction, trained on the webli dataset
Text-to-Image Transformers
V
timm
1,992
0
Blip Image Captioning Large
Bsd-3-clause
A vision-language model pre-trained on the COCO dataset, excelling in generating accurate image descriptions
Image-to-Text
B
drgary
23
1
Convnext Large Mlp.clip Laion2b Ft Soup 320
Apache-2.0
ConvNeXt-Large image encoder based on CLIP architecture, fine-tuned on the LAION-2B dataset, supporting 320x320 resolution image feature extraction
Image Classification Transformers
C
timm
173
0
Convnext Large Mlp.clip Laion2b Augreg
Apache-2.0
ConvNeXt-Large image encoder based on the CLIP framework, trained on the LAION-2B dataset, supports visual feature extraction
Image Classification Transformers
C
timm
107
0
Cogact Small
MIT
CogACT is a novel advanced Vision-Language-Action (VLA) architecture derived from Vision-Language Models (VLM), specifically designed for robot manipulation.
Multimodal Fusion Transformers English
C
CogACT
405
4
Cogact Large
MIT
CogACT is a novel advanced Vision-Language-Action (VLA) architecture derived from Vision-Language Models (VLM), specifically designed for robot manipulation.
Multimodal Fusion Transformers English
C
CogACT
122
3
Cogact Base
MIT
CogACT is a novel Vision-Language-Action (VLA) architecture that combines vision-language models with specialized action modules for robotic manipulation tasks.
Multimodal Fusion Transformers English
C
CogACT
6,589
12
Aimv2 Large Patch14 Native Image Classification
MIT
AIMv2-Large-Patch14-Native is an adapted image classification model, modified from the original AIMv2 model to be compatible with Hugging Face Transformers' AutoModelForImageClassification class.
Image Classification Transformers
A
amaye15
15
2
Sam2.1 Hiera Small
Apache-2.0
SAM 2 is a foundational model for promptable visual segmentation in images and videos developed by FAIR, supporting efficient segmentation through prompts.
Image Segmentation
S
facebook
7,333
6
Sam2.1 Hiera Large
Apache-2.0
SAM 2 is a foundational model for promptable visual segmentation in images and videos developed by FAIR, supporting universal segmentation tasks through prompts.
Image Segmentation
S
facebook
203.27k
81
Sam2 Hiera Base Plus
Apache-2.0
SAM 2 is a foundational model for promptable visual segmentation in images and videos developed by FAIR, supporting efficient segmentation through prompts.
Image Segmentation
S
facebook
18.17k
6
Cogflorence 2.1 Large
MIT
This model is a fine-tuned version of microsoft/Florence-2-large, trained on a subset of 40,000 images from the Ejafa/ye-pop dataset, with annotations generated by THUDM/cogvlm2-llama3-chat-19B, focusing on image-to-text tasks.
Image-to-Text Transformers Supports Multiple Languages
C
thwri
2,541
22
RADIO L
AM-RADIO is a visual foundation model developed by NVIDIA Research, featuring an aggregated architecture for unified multi-domain representation, suitable for various computer vision tasks.
Image Segmentation Transformers
R
nvidia
23.27k
8
RADIO B
RADIO is a vision foundation model developed by NVIDIA Research, capable of unifying visual information across different domains for various vision tasks.
Image Segmentation Transformers
R
nvidia
999
3
Cogflorence 2 Large Freeze
MIT
This is a fine-tuned version of the microsoft/Florence-2-large model, trained on a subset of 38,000 images from the Ejafa/ye-pop dataset, using CogVLM2-generated annotations, focusing on image-to-text tasks.
Image-to-Text Transformers Supports Multiple Languages
C
thwri
419
14
Emotion LLaMA
Apache-2.0
This model is released under the Apache-2.0 license, with specific details currently unknown.
Large Language Model Transformers
E
ZebangCheng
213
4
Fashion Embedder
MIT
FashionCLIP is a vision-language model based on CLIP, specifically fine-tuned for the fashion domain, capable of generating universal fashion product representations.
Text-to-Image Transformers English
F
McClain
58
0
Chronos T5 Small
Apache-2.0
Chronos-T5 is a pre-trained time series forecasting model based on a language model architecture. It converts time series into token sequences through quantization and scaling for training, making it suitable for various time series forecasting tasks.
Climate Model Transformers
C
autogluon
54.04k
5
Zoedepth Kitti
MIT
ZoeDepth is a vision model for monocular depth estimation, fine-tuned on the KITTI dataset, capable of achieving zero-shot transfer for metric depth estimation.
3D Vision Transformers
Z
Intel
7,037
2
Web Register Classification Multilingual
Apache-2.0
A multilingual web register classifier based on fine-tuned XLM-RoBERTa-large, supporting text classification tasks in 100 languages.
Text Classification Transformers Supports Multiple Languages
W
TurkuNLP
106
3
Nllb Uzbek Russian
Apache-2.0
This is an open-source model based on the Apache-2.0 license; specific functionalities depend on the actual model*
Large Language Model Transformers
N
sarahai
54
1
Bert Large Maths
Apache-2.0
Open-source model under Apache-2.0 license (specific details unavailable)
Large Language Model Transformers
B
reyvan
330
1
Image Captioning With Blip
Bsd-3-clause
BLIP is a unified vision-language pretraining framework, excelling in tasks like image caption generation, supporting both conditional and unconditional text generation
Image-to-Text Transformers
I
Vidensogende
16
0
Swissmedical Faqs Classification V1
Apache-2.0
Open-source model based on Apache-2..0 license, specific functionalities depend on the actual model type
Large Language Model Transformers
S
FedericoDamboreana
23
1
Valencearousalvam
No model information available
Large Language Model Transformers
V
MoroQ007
17
0
Image Caption Large Copy
Bsd-3-clause
BLIP is an advanced vision-language pretraining model, excelling in image captioning tasks by effectively utilizing web data through guided annotation strategies
Image-to-Text Transformers
I
Sof22
1,042
10
Blip
Bsd-3-clause
BLIP is an advanced vision-language pretrained model, excelling in image captioning tasks, capable of generating accurate natural language descriptions based on image content.
Image-to-Text Transformers
B
upro
19
2
Table Detection Detr
Other
No specific description provided for this model; unable to generate an introduction
Large Language Model Transformers
T
Christian710
13
2
Blip Image Captioning Large
Bsd-3-clause
BLIP is a unified vision-language pretraining framework, excelling in image caption generation and understanding tasks, efficiently utilizing web data through guided annotation strategies
Image-to-Text Transformers
B
movementso
18
0
Llava 7B Lightening V1 1
LLaVA-Lightning-7B is a multimodal model based on LLaMA-7B, achieving efficient vision-language task processing through delta parameter tuning.
Large Language Model Transformers
L
mmaaz60
1,736
10
Featured Recommended AI Models
AIbase
Empowering the Future, Your AI Solution Knowledge Base
© 2025AIbase